Domain: Telecom
Context:
A telecom company wants to use their historical customer data to predict behaviour to retain
customers. You can analyse all relevant customer data and develop focused customer retention
programs.
Data Description:
Each row represents a customer, each column contains customer’s attributes described on the
column Metadata. The data set includes information about:
● Customers who left within the last month – the column is called Churn
● Services that each customer has signed up for – phone, multiple lines, internet, online security, online
backup, device protection, tech support, and streaming TV and movies
● Customer account information – how long they’ve been a customer, contract, payment method,
paperless billing, monthly charges, and total charges
● Demographic info about customers – gender, age range, and if they have partners and dependents
Project Objective:
Build a model that will help to identify the potential customers who have a higher probability to churn.
This helps the company to understand the painpoints and patterns of customer churn and will increase the
focus on strategizing customer retention.
def dependencies():
global pd,np,re,time,warnings
global px,make_subplots,go
global train_test_split, metrics
global StandardScaler
global xgb,GridSearchCV,RandomizedSearchCV
import pandas as pd
import numpy as np
from plotly import express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import re
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
import xgboost as xgb
import time
import warnings
pd.set_option('display.max_columns', None) # display all columns without truncating
warnings.filterwarnings('ignore') # mute warnings
dependencies()
Steps and Tasks:
1. Data Understanding and Exploration:
a. Read ‘TelcomCustomer-Churn_1.csv’ as a DataFrame and assign it to a variable.
b. Read ‘TelcomCustomer-Churn_2.csv’ as a DataFrame and assign it to a variable.
c. Merge both the DataFrames on key ‘customerID’ to form a single DataFrame.
d. Verify if all the columns are incorporated in the merged DataFrame by using simple comparison
Operator in Python.
# read csv files to d1 & d2
d1 = pd.read_csv(r"TelcomCustomer-Churn_1.csv")
d2 = pd.read_csv(r"TelcomCustomer-Churn_2.csv")
# review both dataframes
print("dataframe1\nshape:",d1.shape)
display(d1.sample(5))
print("\n\ndataframe2\nshape:",d2.shape)
display(d2.sample(5))
dataframe1 shape: (7043, 10)
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | |
|---|---|---|---|---|---|---|---|---|---|---|
| 4861 | 6941-KXRRV | Female | 1 | Yes | No | 48 | Yes | No | DSL | No |
| 1137 | 0831-JNISG | Male | 0 | Yes | Yes | 71 | Yes | No | No | No internet service |
| 2866 | 7517-LDMPS | Female | 0 | No | No | 12 | Yes | No | Fiber optic | No |
| 5245 | 5902-WBLSE | Female | 0 | Yes | Yes | 70 | Yes | No | No | No internet service |
| 1876 | 3946-JEWRQ | Male | 0 | Yes | No | 47 | Yes | Yes | Fiber optic | No |
dataframe2 shape: (7043, 12)
| customerID | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2638 | 6769-DCQLI | Yes | Yes | Yes | Yes | Yes | One year | Yes | Bank transfer (automatic) | 105.00 | 5426.85 | Yes |
| 2671 | 4191-XOVOM | Yes | No | Yes | Yes | Yes | Month-to-month | No | Electronic check | 105.40 | 6713.2 | No |
| 2133 | 8051-HJRLT | No | No | No | No | No | Month-to-month | Yes | Electronic check | 70.55 | 70.55 | Yes |
| 6429 | 6332-FBZRI | Yes | Yes | Yes | No | No | One year | Yes | Credit card (automatic) | 69.35 | 4653.25 | No |
| 5942 | 7240-ETPTR | No | Yes | No | Yes | Yes | Month-to-month | Yes | Electronic check | 48.75 | 442.2 | Yes |
both data sets contains different attributes about the customers
lets check column customerID for commonality of records in both datasets
d1.customerID.describe()
count 7043 unique 7043 top 8012-SOUDQ freq 1 Name: customerID, dtype: object
d2.customerID.describe()
count 7043 unique 7043 top 8012-SOUDQ freq 1 Name: customerID, dtype: object
both data sets have 7043 unique records
lets check if they are about the same set of customers
(d1.customerID==d2.customerID).value_counts()
True 7043 Name: customerID, dtype: int64
the customerID of both datasets match with each other in the same sequence
a very simple merge will be sufficient
df=pd.merge(d1,d2,on='customerID',how='outer') # merge on customerID
df.head() # review new dataframe
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | Yes | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | No | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.5 | No |
| 2 | 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | Yes | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | 7795-CFOCW | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | No | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | 9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
# verify merger of columns
c1=list(d1.columns) # columns of dataset 1
c2=list(d2.drop('customerID',axis=1).columns) # columns of dataset 2 except customerID
list(df.columns)==c1+c2 # compare columns with merged columns using equality operator
True
2. Data Cleaning and Analysis :
a. Impute missing/unexpected values in the DataFrame.
# nulls counter
def nulsCount(df):
# dependency : import pandas as pd
d2=pd.DataFrame(columns=["NULL","NAN"])
d2["NULL"] = df.isnull().sum().astype('uint32')
d2["NAN"]=df.isna().sum().astype('uint32')
d2=d2.loc[(d2["NULL"]!=0) | (d2["NAN"]!=0)]
if d2.shape[0]==0:
print("no NULLs/NANs")
return
else:
display(d2)
return d2.sum()
nulsCount(df)
no NULLs/NANs
since the dataframe does not contain any null values (technically),
lets review contents of each column to understand better
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 7043 entries, 0 to 7042 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customerID 7043 non-null object 1 gender 7043 non-null object 2 SeniorCitizen 7043 non-null int64 3 Partner 7043 non-null object 4 Dependents 7043 non-null object 5 tenure 7043 non-null int64 6 PhoneService 7043 non-null object 7 MultipleLines 7043 non-null object 8 InternetService 7043 non-null object 9 OnlineSecurity 7043 non-null object 10 OnlineBackup 7043 non-null object 11 DeviceProtection 7043 non-null object 12 TechSupport 7043 non-null object 13 StreamingTV 7043 non-null object 14 StreamingMovies 7043 non-null object 15 Contract 7043 non-null object 16 PaperlessBilling 7043 non-null object 17 PaymentMethod 7043 non-null object 18 MonthlyCharges 7043 non-null float64 19 TotalCharges 7043 non-null object 20 Churn 7043 non-null object dtypes: float64(1), int64(2), object(18) memory usage: 1.2+ MB
there are varying datatypes of attributes
lets see in detail the unique values & their distribution in each columns
for col in df.drop('customerID',axis=1).columns:
display(df[col].value_counts().sort_index(ascending=True))
Female 3488 Male 3555 Name: gender, dtype: int64
0 5901 1 1142 Name: SeniorCitizen, dtype: int64
No 3641 Yes 3402 Name: Partner, dtype: int64
No 4933 Yes 2110 Name: Dependents, dtype: int64
0 11
1 613
2 238
3 200
4 176
...
68 100
69 95
70 119
71 170
72 362
Name: tenure, Length: 73, dtype: int64
No 682 Yes 6361 Name: PhoneService, dtype: int64
No 3390 No phone service 682 Yes 2971 Name: MultipleLines, dtype: int64
DSL 2421 Fiber optic 3096 No 1526 Name: InternetService, dtype: int64
No 3498 No internet service 1526 Yes 2019 Name: OnlineSecurity, dtype: int64
No 3088 No internet service 1526 Yes 2429 Name: OnlineBackup, dtype: int64
No 3095 No internet service 1526 Yes 2422 Name: DeviceProtection, dtype: int64
No 3473 No internet service 1526 Yes 2044 Name: TechSupport, dtype: int64
No 2810 No internet service 1526 Yes 2707 Name: StreamingTV, dtype: int64
No 2785 No internet service 1526 Yes 2732 Name: StreamingMovies, dtype: int64
Month-to-month 3875 One year 1473 Two year 1695 Name: Contract, dtype: int64
No 2872 Yes 4171 Name: PaperlessBilling, dtype: int64
Bank transfer (automatic) 1544 Credit card (automatic) 1522 Electronic check 2365 Mailed check 1612 Name: PaymentMethod, dtype: int64
18.25 1
18.40 1
18.55 1
18.70 2
18.75 1
..
118.20 1
118.35 1
118.60 2
118.65 1
118.75 1
Name: MonthlyCharges, Length: 1585, dtype: int64
11
100.2 1
100.25 1
100.35 1
100.4 1
..
997.75 1
998.1 1
999.45 1
999.8 1
999.9 1
Name: TotalCharges, Length: 6531, dtype: int64
No 5174 Yes 1869 Name: Churn, dtype: int64
almost every feature has defined data
let us confirm the same below by table fitlers
display(df[["MultipleLines","PhoneService"]].value_counts().to_frame())
| 0 | ||
|---|---|---|
| MultipleLines | PhoneService | |
| No | Yes | 3390 |
| Yes | Yes | 2971 |
| No phone service | No | 682 |
clearly, all the records with "No phone service" are associated with records of "No" class in "PhoneService" attribute
hence they are not an anomoly
lets check the attributes containing "No internet service" records
col_of_interest=[]
for col in df.columns:
txt=df[col].value_counts().index
for ind in txt:
# search using regex search function
if (type(re.search('internet',str(ind)))==re.Match):
col_of_interest.append(col)
for col in col_of_interest:
display(df[[col,"InternetService"]].value_counts().to_frame())
| 0 | ||
|---|---|---|
| OnlineSecurity | InternetService | |
| No | Fiber optic | 2257 |
| No internet service | No | 1526 |
| No | DSL | 1241 |
| Yes | DSL | 1180 |
| Fiber optic | 839 |
| 0 | ||
|---|---|---|
| OnlineBackup | InternetService | |
| No | Fiber optic | 1753 |
| No internet service | No | 1526 |
| Yes | Fiber optic | 1343 |
| No | DSL | 1335 |
| Yes | DSL | 1086 |
| 0 | ||
|---|---|---|
| DeviceProtection | InternetService | |
| No | Fiber optic | 1739 |
| No internet service | No | 1526 |
| Yes | Fiber optic | 1357 |
| No | DSL | 1356 |
| Yes | DSL | 1065 |
| 0 | ||
|---|---|---|
| TechSupport | InternetService | |
| No | Fiber optic | 2230 |
| No internet service | No | 1526 |
| No | DSL | 1243 |
| Yes | DSL | 1178 |
| Fiber optic | 866 |
| 0 | ||
|---|---|---|
| StreamingTV | InternetService | |
| Yes | Fiber optic | 1750 |
| No internet service | No | 1526 |
| No | DSL | 1464 |
| Fiber optic | 1346 | |
| Yes | DSL | 957 |
| 0 | ||
|---|---|---|
| StreamingMovies | InternetService | |
| Yes | Fiber optic | 1751 |
| No internet service | No | 1526 |
| No | DSL | 1440 |
| Fiber optic | 1345 | |
| Yes | DSL | 981 |
similary, all the records with "No internet service" are associated with records of "No" class in "InternetService" attribute
hence they are too not an anomoly
Lets review the blank fields of TotalCharges
df.loc[df["tenure"]==0,["tenure","TotalCharges"]]
| tenure | TotalCharges | |
|---|---|---|
| 488 | 0 | |
| 753 | 0 | |
| 936 | 0 | |
| 1082 | 0 | |
| 1340 | 0 | |
| 3331 | 0 | |
| 3826 | 0 | |
| 4380 | 0 | |
| 5218 | 0 | |
| 6670 | 0 | |
| 6754 | 0 |
The 11 blanks of TotalCharges column are associated with 0 tenure records
These could be aptly imputed with 0
# store record numbers for later checking
ref=df.loc[df["TotalCharges"]==" ",["TotalCharges"]].index
# impute blanks with 0s
df.loc[df["TotalCharges"]==" ",["TotalCharges"]] = 0
# review the imputed fields
df.loc[ref,["TotalCharges"]]
| TotalCharges | |
|---|---|
| 488 | 0 |
| 753 | 0 |
| 936 | 0 |
| 1082 | 0 |
| 1340 | 0 |
| 3331 | 0 |
| 3826 | 0 |
| 4380 | 0 |
| 5218 | 0 |
| 6670 | 0 |
| 6754 | 0 |
2. Data Cleaning and Analysis :
b. Make sure all the variables with continuous values are of ‘Float’ type.
# lets universally convert all numeric values in the dataframe to float type
for col in df.columns:
df[col]=pd.to_numeric(df[col],errors='ignore',downcast='float')
df.dtypes # review datatypes
customerID object gender object SeniorCitizen float32 Partner object Dependents object tenure float32 PhoneService object MultipleLines object InternetService object OnlineSecurity object OnlineBackup object DeviceProtection object TechSupport object StreamingTV object StreamingMovies object Contract object PaperlessBilling object PaymentMethod object MonthlyCharges float32 TotalCharges float32 Churn object dtype: object
2. Data Cleaning and Analysis :
c. Create a function that will accept a DataFrame as input and return pie-charts for all the
appropriate Categorical features. Clearly show percentage distribution in the pie-chart.
d. Share insights for Q2.c.
before creating such a function, let us temporarily make object type entries in SeniorCitizen column
later all the object columns shall be encoded appropritely for model building
df.loc[df.SeniorCitizen==0,["SeniorCitizen"]]="No"
df.loc[df.SeniorCitizen==1,["SeniorCitizen"]]="Yes"
df.dtypes
customerID object gender object SeniorCitizen object Partner object Dependents object tenure float32 PhoneService object MultipleLines object InternetService object OnlineSecurity object OnlineBackup object DeviceProtection object TechSupport object StreamingTV object StreamingMovies object Contract object PaperlessBilling object PaymentMethod object MonthlyCharges float32 TotalCharges float32 Churn object dtype: object
# since finding distributions of attributes along with target cross reference will add more meaning to the visualisation,
# let us add optional target variable to the pie chart
def dfPie(df,target=None):
if target==None: # without target
cols=df.select_dtypes(include='object').columns # exclude nonn categorical attributes based on datatypes
else: # with target
cols=df.select_dtypes(include='object').columns.drop(target) # drop target column as sunburst over same columns is not possible
for col in cols: # loop over all columns
if target==None: # without target
fig=px.sunburst(df,path=[col], # produces pie chart
height=600,hover_name=col+":"+df[col],
color=df[col],color_discrete_sequence=px.colors.carto.Vivid)
fig.update_layout(title="Distribution of "+col+" attribute")
else: # with target
fig=px.sunburst(df,path=[col,target], # produces sunburst chart
height=600,hover_name=col+":"+df[col]+"<br>"+target+":"+df[target],
color=df[col],color_discrete_sequence=px.colors.carto.Vivid)
fig.update_layout(title="Distribution of "+col+" attribute with "+target+" rays")
fig.add_annotation(x=0.75,y=0.52,text=target,showarrow=False,font=dict(size=18))
# updates figure
fig.update_traces(textinfo="label+percent entry",insidetextorientation='tangential')
fig.add_annotation(x=0.5,y=0.52,text=col,showarrow=False,font=dict(size=17))
fig.show()
if target!=None: # Pie plot for target column
fig=px.sunburst(df,path=[target],
height=600,hover_name=col+":"+df[col],
color=df[target],color_discrete_sequence=px.colors.carto.Vivid)
fig.update_layout(title="Distribution of "+target+" variable")
fig.update_traces(textinfo="label+percent entry",insidetextorientation='tangential')
fig.add_annotation(x=0.5,y=0.52,text=target,showarrow=False,font=dict(size=17))
fig.show()
dfPie(df.drop("customerID",axis=1),"Churn")
the above pie charts show us several distributions,
let us study the patterns along with ML tools shortly
2. Data Cleaning and Analysis :
e. Encode all the appropriate Categorical features with the best suitable approach.
# lets group attributes based on their classes
dummyCols=[]
labelCols=[]
for i in df.select_dtypes(include='object').drop('customerID',axis=1).columns:
if df[i].nunique()==2: # only YES / NO attributes (basis of given data, not a generic methodology)
dummyCols.append(i)
else: # more than 2 class, probably with ordinal data
labelCols.append(i)
#display(dummyCols,labelCols)
# dummies for binary classes
dums=pd.get_dummies(df[dummyCols],drop_first=True)
dummiedCols=dums.columns
dums.head()
# these dummies are binary class values, hence need not be scaled again (stanrdising/normalising),
# but no harm in scalling to make coding simpler
| gender_Male | SeniorCitizen_Yes | Partner_Yes | Dependents_Yes | PhoneService_Yes | PaperlessBilling_Yes | Churn_Yes | |
|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
| 2 | 1 | 0 | 0 | 0 | 1 | 1 | 1 |
| 3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
# lets write down the classes in each of the labelCols
for i in labelCols:
display(i,list(df[i].unique()))
'MultipleLines'
['No phone service', 'No', 'Yes']
'InternetService'
['DSL', 'Fiber optic', 'No']
'OnlineSecurity'
['No', 'Yes', 'No internet service']
'OnlineBackup'
['Yes', 'No', 'No internet service']
'DeviceProtection'
['No', 'Yes', 'No internet service']
'TechSupport'
['No', 'Yes', 'No internet service']
'StreamingTV'
['No', 'Yes', 'No internet service']
'StreamingMovies'
['No', 'Yes', 'No internet service']
'Contract'
['Month-to-month', 'One year', 'Two year']
'PaymentMethod'
['Electronic check', 'Mailed check', 'Bank transfer (automatic)', 'Credit card (automatic)']
# lets create key value pairs for the classes
keys = {
'MultipleLines' : {'No phone service':-1, 'No':0, 'Yes':1}, # -1 for No phone service emphasises the ordinal form of the classes
'InternetService' : {'No':0, 'DSL':1, 'Fiber optic':2 }, # FibreOptic is superior to DSL
'OnlineSecurity' : {'No internet service':-1, 'No':0, 'Yes':1},
'OnlineBackup' : {'No internet service':-1, 'No':0, 'Yes':1},
'DeviceProtection' : {'No internet service':-1, 'No':0, 'Yes':1},
'TechSupport' : {'No internet service':-1, 'No':0, 'Yes':1},
'StreamingTV' : {'No internet service':-1, 'No':0, 'Yes':1},
'StreamingMovies' : {'No internet service':-1, 'No':0, 'Yes':1},
'Contract' : {'Month-to-month':0, 'One year':1, 'Two year':2},
'PaymentMethod' : {'Mailed check':0, 'Electronic check':1, 'Bank transfer (automatic)':2, 'Credit card (automatic)':3}
}
# though these are ordinal classes, ranking is significant, while adding weights will misguide the model
# hence, these will be standardised with z-score along with numerical data to avoid weightage (like 3x for Creditcard)
# apply all encoding
# replace dummies
df.drop(dummyCols,axis=1,inplace=True)
df=pd.merge(df,dums,left_index=True,right_index=True)
# apply label encoding
df=df.replace(keys)
# drop customerID for further steps
df.drop("customerID",axis=1,inplace=True)
display(df.info())
df.head()
<class 'pandas.core.frame.DataFrame'> Int64Index: 7043 entries, 0 to 7042 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tenure 7043 non-null float32 1 MultipleLines 7043 non-null int64 2 InternetService 7043 non-null int64 3 OnlineSecurity 7043 non-null int64 4 OnlineBackup 7043 non-null int64 5 DeviceProtection 7043 non-null int64 6 TechSupport 7043 non-null int64 7 StreamingTV 7043 non-null int64 8 StreamingMovies 7043 non-null int64 9 Contract 7043 non-null int64 10 PaymentMethod 7043 non-null int64 11 MonthlyCharges 7043 non-null float32 12 TotalCharges 7043 non-null float32 13 gender_Male 7043 non-null uint8 14 SeniorCitizen_Yes 7043 non-null uint8 15 Partner_Yes 7043 non-null uint8 16 Dependents_Yes 7043 non-null uint8 17 PhoneService_Yes 7043 non-null uint8 18 PaperlessBilling_Yes 7043 non-null uint8 19 Churn_Yes 7043 non-null uint8 dtypes: float32(3), int64(10), uint8(7) memory usage: 994.0 KB
None
| tenure | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaymentMethod | MonthlyCharges | TotalCharges | gender_Male | SeniorCitizen_Yes | Partner_Yes | Dependents_Yes | PhoneService_Yes | PaperlessBilling_Yes | Churn_Yes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | -1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 29.850000 | 29.850000 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 1 | 34.0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 56.950001 | 1889.500000 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
| 2 | 2.0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 53.849998 | 108.150002 | 1 | 0 | 0 | 0 | 1 | 1 | 1 |
| 3 | 45.0 | -1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 2 | 42.299999 | 1840.750000 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 2.0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 70.699997 | 151.649994 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
2. Data Cleaning and Analysis :
f. Split the data into 80% train and 20% test.
X=df.drop("Churn_Yes",axis=1)
Y=df["Churn_Yes"]
print("X:",X.shape,"\nY:",Y.shape)# reference dimensions
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,test_size=0.20, # split ratio of 80:20
random_state=129) # random seed
# review dimensions after split
print("X_train:",X_train.shape,
"\nX_test:",X_test.shape,
"\nY_train:",Y_train.shape,
"\nY_test:",Y_test.shape)
X: (7043, 19) Y: (7043,) X_train: (5634, 19) X_test: (1409, 19) Y_train: (5634,) Y_test: (1409,)
2. Data Cleaning and Analysis :
g. Normalize/Standardize the data with the best suitable approach.
# let us define review the distribution of training dataset using box plots
def dfBox(df):
gdata=df.columns
fig = make_subplots(rows=len(df.columns),cols=1)
n=1
for i in df.columns:
fig.add_trace(go.Box(x=df[i],name=i,hovertemplate='%{x}',boxpoints='outliers',jitter=0.7,pointpos=0,boxmean='sd'),n,1)
n+=1
fig.update_layout(height=500,showlegend=False)
fig.show()
dfBox(X_train[["tenure","MonthlyCharges","TotalCharges"]])
since the data range are widely varying, scaling is necessary
Xscl=StandardScaler()
Yscl=StandardScaler()
# scale Train data set
X_train=pd.DataFrame(Xscl.fit_transform(X_train),columns=X_train.columns)
Y_train=pd.Series(Yscl.fit_transform(Y_train.values.reshape(-1,1))[:,0],name=Y_train.name)
# scale test data for scoring purpose
X_test=pd.DataFrame(Xscl.transform(X_test),columns=X_test.columns)
#Y_test=pd.Series(Yscl.transform(Y_test.values.reshape(-1,1))[:,0],name=Y_test.name)
dfBox(X_train[["tenure","MonthlyCharges","TotalCharges"]]) # review scaled train data
Y_train.value_counts()
-0.606745 4118 1.648138 1516 Name: Churn_Yes, dtype: int64
using StandardScaler because this z-score based method maintains the inherent weightage and overhang of the dataset,
rather than bounding the limits like in Normalising
such a behaviour helps in imbalanced dataset.
the same can be observed in Y_train scaled values, the lower proportion of "Yes" have been weighted more in magnitude of "No"
While displaying result of classification model,inverse_transform must be applied on Y values
# delete instance of global variable scoreLog
try:
del scoreLog
print("scoreLog deleted")
except:
print("scoreLog undefined")
# defining a function to report classification metrics
def reporter(Y_train, pred_train, Y_test, pred_test,model_name):
"""Classification report for 2 class prediction problem
logs test scores to global dataframe named scoreLog
the scoreLog (with any previous scores) will be displayed
also displays confusion matrices of current instance of arguments
---------------------------------------------------------------------------
Y_train ==> TRUE classes used for training (pandas series object or numpy array of 1-D)
pred_train ==> PREDICTION on training data (pandas series object or numpy array of 1-D)
Y_test ==> TRUE classes to be used for testing (pandas series object or numpy array of 1-D)
pred_test ==> PREDICTION on test data (pandas series object or numpy array of 1-D)
model_name ==> str name for current model, to be used as index for scoreLog
---------------------------------------------------------------------------
"""
from sklearn import metrics
import plotly.figure_factory as ff
import numpy as np
import pandas as pd
global scoreLog
classes=list(Y_test.unique())
cols=["accuracy"]
cols.extend(["precision_"+str(classes[i]) for i in range(len(classes))])
cols.extend(["recall_"+str(classes[i]) for i in range(len(classes))])
cols.extend(["fscore_"+str(classes[i]) for i in range(len(classes))])
try:
type(scoreLog)
except:
scoreLog=pd.DataFrame(columns=cols)
#metrics based on training set
#confusion matrix
z=pd.DataFrame(metrics.confusion_matrix(Y_train, pred_train))
fig1=ff.create_annotated_heatmap(np.array(z),annotation_text=np.array(z),
x=list(np.sort(np.unique(Y_train))),y=list(np.sort(np.unique(Y_train))),
colorscale='Mint',font_colors = ['grey','white'],name="TRAINING SET",
hovertemplate="Prediction: %{x:d}<br>True: %{y:d}<br>Count: %{z:d}")
fig1.update_layout(height=350,width=350)
fig1.update_xaxes(title_text="PREDICTED (TRAINING SET) - "+model_name)
fig1.update_yaxes(title_text="TRUE",tickangle=270)
#scores
score=[metrics.accuracy_score(Y_train,pred_train)]
score.extend(metrics.precision_score(Y_train,pred_train,labels=classes,average=None))
score.extend(metrics.recall_score(Y_train,pred_train,labels=classes,average=None))
score.extend(metrics.f1_score(Y_train,pred_train,labels=classes,average=None))
scoreLog=scoreLog.append(pd.DataFrame(score,index=cols,columns=[model_name+"_training"]).T)
#metrics based on test set
#confusion matrix
z=pd.DataFrame(metrics.confusion_matrix(Y_test, pred_test))
fig2=ff.create_annotated_heatmap(np.array(z),annotation_text=np.array(z),
x=list(np.sort(np.unique(Y_test))),y=list(np.sort(np.unique(Y_test))),
colorscale='Mint',font_colors = ['grey','white'],name="TEST SET",
hovertemplate="Prediction: %{x:d}<br>True: %{y:d}<br>Count: %{z:d}")
fig2.update_layout(height=350,width=350)
fig2.update_xaxes(title_text="PREDICTED (TEST SET) - "+model_name)
fig2.update_yaxes(title_text="TRUE",tickangle=270)
#scores
score=[metrics.accuracy_score(Y_test,pred_test)]
score.extend(metrics.precision_score(Y_test,pred_test,labels=classes,average=None))
score.extend(metrics.recall_score(Y_test,pred_test,labels=classes,average=None))
score.extend(metrics.f1_score(Y_test,pred_test,labels=classes,average=None))
scoreLog=scoreLog.append(pd.DataFrame(score,index=cols,columns=[model_name+"_test"]).T)
# merge both confusion matrix heatplots
fig=make_subplots(rows=1,cols=2,horizontal_spacing=0.05)
fig.add_trace(fig1.data[0],row=1,col=1)#,name="training data")
fig.add_trace(fig2.data[0],row=1,col=2)#,name="test data")
annot1 = list(fig1.layout.annotations)
annot2 = list(fig2.layout.annotations)
for k in range(len(annot2)):
annot2[k]['xref'] = 'x2'
annot2[k]['yref'] = 'y2'
fig.update_layout(annotations=annot1+annot2)
fig.layout.xaxis.update(fig1.layout.xaxis)
fig.layout.yaxis.update(fig1.layout.yaxis)
fig.layout.xaxis2.update(fig2.layout.xaxis)
fig.layout.yaxis2.update(fig2.layout.yaxis)
fig.layout.yaxis2.update({'title': {'text': ''}})
display(scoreLog)
fig.show()
scoreLog undefined
3. Model building and Improvement:
a. Train a model using XGBoost and use RandomizedSearchCV to train on best parameters.
Also print best performing parameters along with train and test performance.
#lets build a base model using XGBoost
xgcl=xgb.XGBClassifier(objective='reg:logistic', # for classification decision making
colsample_bytree=0.3, # percentage of features used per tree
subsample=0.3, # percentage of samples used per tree
learning_rate=0.1, # feature weight scaling to control overfitting
max_depth=5, # allowed depth for each tree
alpha=10, # regularisation
n_estimators=10, # number of trees
)
xgcl.fit(X_train,Y_train)
pred_train=Yscl.inverse_transform(xgcl.predict(X_train))
pred_test=Yscl.inverse_transform(xgcl.predict(X_test))
reporter(Yscl.inverse_transform(Y_train),pred_train,Y_test,pred_test,"XGBbase")
| accuracy | precision_0 | precision_1 | recall_0 | recall_1 | fscore_0 | fscore_1 | |
|---|---|---|---|---|---|---|---|
| XGBbase_training | 0.790735 | 0.806849 | 0.699408 | 0.938320 | 0.389842 | 0.867632 | 0.500635 |
| XGBbase_test | 0.798439 | 0.814845 | 0.688525 | 0.946023 | 0.356941 | 0.875548 | 0.470149 |
#let us now perform hypertuning using RandomizedSearchCV
xgcl_param_dist={'colsample_bytree': np.arange(0.3, 1.0, 0.1),
'subsample': np.arange(0.2, 1.0, 0.05),
'learning_rate': np.arange(0.05,0.5,0.05),
'max_depth': np.arange(2,21,2),
'alpha': np.arange(5,25,2),
'n_estimators': np.arange(10,500,10),
'objective' : ['reg:logistic']}
xgcl_rand_tuned=RandomizedSearchCV(estimator=xgcl,param_distributions=xgcl_param_dist,
n_iter=50,cv=10,scoring="recall")
st1=time.process_time()
xgcl_rand_tuned.fit(X_train,Y_train)
print("time taken : %.6f"%((time.process_time()-st1)))
time taken : 2390.281250
pred_train=Yscl.inverse_transform(xgcl_rand_tuned.predict(X_train))
pred_test=Yscl.inverse_transform(xgcl_rand_tuned.predict(X_test))
reporter(Yscl.inverse_transform(Y_train),pred_train,Y_test,pred_test,"Random_tuned")
| accuracy | precision_0 | precision_1 | recall_0 | recall_1 | fscore_0 | fscore_1 | |
|---|---|---|---|---|---|---|---|
| XGBbase_training | 0.790735 | 0.806849 | 0.699408 | 0.938320 | 0.389842 | 0.867632 | 0.500635 |
| XGBbase_test | 0.798439 | 0.814845 | 0.688525 | 0.946023 | 0.356941 | 0.875548 | 0.470149 |
| Random_tuned_training | 0.834576 | 0.861552 | 0.737785 | 0.921807 | 0.597625 | 0.890662 | 0.660350 |
| Random_tuned_test | 0.813343 | 0.858824 | 0.648026 | 0.898674 | 0.558074 | 0.878297 | 0.599696 |
Based on the scoreLog table, clearly tuned performances has increased in terms of recall score (as recall was the scoring parameter for search)
Lets also print the best parameters below
xgcl_rand_tuned.best_params_ #best parameters foundout by RandomizedSearchCV
{'subsample': 0.49999999999999994,
'objective': 'reg:logistic',
'n_estimators': 220,
'max_depth': 6,
'learning_rate': 0.15000000000000002,
'colsample_bytree': 0.9000000000000001,
'alpha': 15}
3. Model building and Improvement:
b. Train a model using XGBoost and use GridSearchCV to train on best parameters.
Also print best performing parameters along with train and test performance.
# since we have the base model xgcl, let us hyper tune with GridSearchCV and compare
#let us create param_grid for search
#considering excessive computational cost of grid search,
#lets limit the grid to less than 100 combinations, may be with reduced hyperparameters also
xgcl_param_grid={'colsample_bytree': [0.3,0.5],
'learning_rate': [0.05,0.1,0.25,0.4],
'max_depth': [12,15,18],
'n_estimators': [350,375,400],
'subsample': [0.25],
'alpha' : [23],
'objective' : ['reg:logistic']}
xgcl_grid_tuned=GridSearchCV(estimator=xgcl,param_grid=xgcl_param_grid,
cv=10,scoring="recall")
st1=time.process_time()
xgcl_grid_tuned.fit(X_train,Y_train)
print("time taken : %.6f"%((time.process_time()-st1)))
time taken : 3445.093750
pred_train=Yscl.inverse_transform(xgcl_grid_tuned.predict(X_train))
pred_test=Yscl.inverse_transform(xgcl_grid_tuned.predict(X_test))
reporter(Yscl.inverse_transform(Y_train),pred_train,Y_test,pred_test,"Grid_tuned")
| accuracy | precision_0 | precision_1 | recall_0 | recall_1 | fscore_0 | fscore_1 | |
|---|---|---|---|---|---|---|---|
| XGBbase_training | 0.790735 | 0.806849 | 0.699408 | 0.938320 | 0.389842 | 0.867632 | 0.500635 |
| XGBbase_test | 0.798439 | 0.814845 | 0.688525 | 0.946023 | 0.356941 | 0.875548 | 0.470149 |
| Random_tuned_training | 0.834576 | 0.861552 | 0.737785 | 0.921807 | 0.597625 | 0.890662 | 0.660350 |
| Random_tuned_test | 0.813343 | 0.858824 | 0.648026 | 0.898674 | 0.558074 | 0.878297 | 0.599696 |
| Grid_tuned_training | 0.810437 | 0.845805 | 0.683007 | 0.905780 | 0.551451 | 0.874765 | 0.610219 |
| Grid_tuned_test | 0.811214 | 0.855216 | 0.646465 | 0.900568 | 0.543909 | 0.877306 | 0.590769 |
xgcl_grid_tuned.best_params_ #best parameters foundout by GridSearchCV
{'alpha': 23,
'colsample_bytree': 0.3,
'learning_rate': 0.05,
'max_depth': 12,
'n_estimators': 350,
'objective': 'reg:logistic',
'subsample': 0.25}
the above table scoreLog shows complete list of performance scores of all 3 models, both on training & test data sets
clearly randomizedsearchCV hits close to the best parameters than gridsearchCV
A test dataset scores of 0.55 for Churn_Yes recall and test accuracy of 0.81 were achieved in RandomizedSearchCV tuning,
improving from 0.35 & 0.79 respectively in base XGBoost model
======================================================================================================
• DOMAIN: IT
• CONTEXT: The purpose is to build a machine learning pipeline that will work autonomously irrespective of Data and users can save efforts involved in building pipelines for each dataset.
• PROJECT OBJECTIVE: Build a machine learning pipeline that will run autonomously with the csv file and return best performing model.
• STEPS AND TASK
Include best coding practices in the code:
• Modularization
• Maintainability
• Well commented code etc.
def imports():
"""Import necessary packages globally
other prerequisite settings can be added (like warning ignore)"""
global pd,np,re,time,warnings,traceback, pickle
global px,make_subplots,go,ff
global train_test_split, metrics
global StandardScaler, StratifiedKFold
global XGBClassifier, RandomForestClassifier
global LogisticRegression, KNeighborsClassifier
global GaussianNB, SVC, DecisionTreeClassifier
global AdaBoostClassifier, GradientBoostingClassifier
global BaggingClassifier, SMOTE, cross_val_score
import pandas as pd
import numpy as np
from plotly import express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.figure_factory as ff
import re
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from imblearn.over_sampling import SMOTE
import time
import warnings
import traceback
import pickle
pd.set_option('display.max_columns', None) # display all columns without truncating
warnings.filterwarnings('ignore') # mute warnings
debug=True # set to False for 0 verbose
from inspect import getframeinfo, currentframe
def linepass():
"""debugging assist"""
if debug: print("Reached %s"%(currentframe().f_back.f_code.co_name),"method, line",str(getframeinfo(currentframe().f_back).lineno))
class data():
"""encapsulate data fetching, cleaning"""
# Constructor method
def __init__(self,files,path=list(""),on=None,how='outer',drop_on=False,target=None):
""" read csv from path. if list of files passed on,
read the csv files from path and merge using 'on' columns and how method
read dataframe will be cleaned for further processing
files : filename as str or list of file names as str
path : <default ""> path to file directory as str
if "", files will be searched in same directory as code
on : <default None> in case of multiple files, column name for merge,
if None, common columns will be used by pandas merge
how : <default outer> refer pandas merge types
drop_on : <default False> if True, 'on' will be dropped for deletion of index
target : <default None> target column name as str,
if None, last column in dataframe will be considered as Target"""
linepass() # debug info
# record instance name
(filename,line_number,function_name,text)=traceback.extract_stack()[-2]
self.name=text[:text.find("=")].strip()
# dataframe initiation
self.df=pd.DataFrame()
# read files
self.reader(files,path,on,how,drop_on,target)
def reader(self,files,path=list(""),on=None,how='outer',drop_on=False,target=None):
linepass() # debug info
#read sets of files
if (type(files)==list and len(files)>1):
self.df=pd.read_csv(path+files[0])
for i in range(1,len(files)):
subdf=pd.read_csv(path+files[i])
self.df=pd.merge(self.df,subdf,on=on,how=how)
else: # read a single file
self.df=pd.read_csv(path+files)
if drop_on: # drop merger columns
self.df.drop(on,axis=1,inplace=True)
if target==None: # fix target feature
self.target=self.df.columns[-1]
elif re.search('str',str(type(target)))==None:
raise ValueError("Target could be only single feature")
else:
self.target=target
# check for unlabled target variable & drop the records
keys={" ":float("NaN"),
"-":float("NaN"),
"NA":float("NaN"),
"N/A":float("NaN")}
if debug: display(self.df[self.target].value_counts())
self.df[self.target].replace(keys,inplace=True)
self.df.dropna(subset=[self.target],inplace=True)
# append to data class
data.reader=reader
def splitter(self):
"""split data set into training, validation & testing set
default split ratio Training:Validation:Testing::80:20"""
linepass() # debug info
L=self.df.shape[0]
# seed pseudo random generator
np.random.seed(129)
indices=np.random.choice(range(L),L,False)
s1=int(np.floor(L*0.8))
# split dataframe
d1=self.df.iloc[indices[:s1]]
d2=self.df.iloc[indices[s1:]]
# create training dataset
self.X_train=d1.drop(self.target,axis=1).copy()
self.Y_train=d1[self.target].copy()
# create test dataset
self.X_test=d2.drop(self.target,axis=1).copy()
self.Y_test=d2[self.target].copy()
return self.X_train, self.Y_train, self.X_test, self.Y_test
# append to data class
data.splitter=splitter
# data cleaning
def fit_transform(self,X):
"""perform basic, non-subjective data cleaning
X : dataset to be processed
Note that any domain specific data preparation not incorporated"""
self.fit(X)
X=self.transform(X)
return X
# fitting
def fit(self,X):
"""collect all necessary checkpoints for preprocessing
X : dataset to be fit
Note that any domain specific data preparation not incorporated"""
linepass() # debug info
# methods
self.uniqueKey(X)
self.uniformKey(X)
self.nullnans_replacement(X)
# register fitting
self.fitted=True
# implementation
def transform(self,X):
"""implement cleaning based on fit information
Note that any domain specific data preparation not incorporated
X : dataset to be transformed"""
linepass() # debug info
# verify fitting
if not self.fitted:
raise ValueError("{0} not fit\ncall {0}.fit() before transforming".format(self.name))
# methods
X=self.uniqueKey_drop(X)
X=self.uniformKey_drop(X)
X=self.nullnans_impute(X)
return X
# append to data class
data.fit_transform=fit_transform
data.fit=fit
data.transform=transform
def uniqueKey(self,X):
"""find record-ID columns (unique Key)"""
linepass() # debug info
self.unique_key=[]
L=X.shape[0]
for i in X.select_dtypes(include='object').columns:
if X[i].nunique()==L: # all different values
self.unique_key.append(i)
if len(self.unique_key)>0:
if debug: print('Unique Key features identified\n',self.unique_key)
# append to data class
data.uniqueKey=uniqueKey
def uniqueKey_drop(self,X):
"""drop unique Key columns
because they dont add any meaningful relation to classification target"""
linepass() # debug info
if len(self.unique_key)>0:
X=X.drop(self.unique_key,axis=1).copy()
if debug: print('Unique key features dropped',self.unique_key)
return X
# append to data class
data.uniqueKey_drop=uniqueKey_drop
def uniformKey(self,X):
"""find single value columns"""
linepass() # debug info
self.uniform_key=[]
for i in X.columns:
if X[i].nunique()==1: # all same values
self.uniform_key.append(i)
if len(self.uniform_key)>0:
if debug: print('Single value features identified\n',self.uniform_key)
# append to data class
data.uniformKey=uniformKey
def uniformKey_drop(self,X):
"""drop single valued columns
because they dont add any meaningful relation to classification target"""
linepass() # debug info
if len(self.uniform_key)>0:
X=X.drop(uniform_key,axis=1).copy()
if debug: print("SingleValue columns dropped",self.uniform_key)
return X
# append to data class
data.uniformKey_drop=uniformKey_drop
def nullnans_replacement(self,X):
"""identifies replacement values to be used in place of NULLs/NANs"""
linepass() # debug info
# most common unexpected values
keys={" ":float("NaN"),
"-":float("NaN"),
"NA":float("NaN"),
"N/A":float("NaN")}
replacements=pd.DataFrame(columns=X.columns)
# create reference copy of data
temp=X.replace(keys).copy()
# find replacement values
for col in temp.columns:
temp[col]=pd.to_numeric(temp[col],errors='ignore')
if re.search('obj',str(temp.dtypes[col]))!=None:
replacements.loc[0,col]=temp[col].mode()[0] # store mode for categorical features
else:
replacements.loc[0,col]=temp[col].mean() # store mean for numeric features
# convert replacements to a dictionary
self.replacements=dict()
for col in replacements.columns:
self.replacements[col]=replacements.loc[0,col]
# append to data class
data.nullnans_replacement=nullnans_replacement
def nullnans_impute(self,X):
"""impute unexpected values in the records"""
linepass() # debug info
# most common unexpected values
keys={" ":float("NaN"),
"-":float("NaN"),
"NA":float("NaN"),
"N/A":float("NaN")}
# replace unexpected values
X=X.replace(keys).copy()
# impute NANs
X=X.fillna(self.replacements).copy()
# convert all numeric features in the dataframe to numeric type int64 or float64
for col in X.columns:
X[col]=pd.to_numeric(X[col],errors='ignore')
return X
# append to data class
data.nullnans_impute=nullnans_impute
class preprocessing():
"""prepare cleaned dataset for machine learning
encode, scale"""
def __init__(self):
"""invoke methods"""
linepass() # debug info
# record instance name
(filename,line_number,function_name,text)=traceback.extract_stack()[-2]
self.name=text[:text.find("=")].strip()
self.scaler = StandardScaler()
def fit(self,X):
"""dummy keys call"""
linepass() # debug info
self.encoder_keys(X)
# append to preprocessing class
preprocessing.fit=fit
def transform(self,X):
"""encode & scale"""
linepass() # debug info
# verify fitting
if not self.fitted:
raise ValueError("{0} not fit\ncall {0}.fit_transform() before re-transforming".format(self.name))
#encoding
X = self.encoder(X)
#scaling
X=pd.DataFrame(self.scaler.transform(X),columns=X.columns)
return X
# append to preprocessing class
preprocessing.transform=transform
def fit_transform(self,X):
"""fit & transform preprocessors"""
linepass() # debug info
#encoding
self.encoder_keys(X)
X = self.encoder(X)
#scaling
self.scaler.fit(X)
X=pd.DataFrame(self.scaler.transform(X),columns=X.columns)
self.fitted=True
return X
# append to preprocessing class
preprocessing.fit_transform=fit_transform
def encoder_keys(self,X):
"""dummy encoder keys for all categorical features"""
linepass() # debug info
self.dummy_cols=X.select_dtypes(include='object').columns
# append to preprocessor class
preprocessing.encoder_keys=encoder_keys
def encoder(self,X):
"""dummy encoder"""
linepass() # debug info
if len(self.dummy_cols)>0:
dums=pd.get_dummies(X[self.dummy_cols],drop_first=True) # generate dummies
X=X.drop(self.dummy_cols,axis=1).copy() # drop original features
X=X.merge(dums,left_index=True,right_index=True).copy() # merge dummified features
return X
# append to preprocessor class
preprocessing.encoder=encoder
class classifiers(preprocessing):
""" Machine Learning Classifiers"""
def __init__(self):
"""initiate classification learners"""
linepass() # debug info
# record instance name
(filename,line_number,function_name,text)=traceback.extract_stack()[-2]
self.name=text[:text.find("=")].strip()
# base models for classification problems
self.models = [ XGBClassifier(objective='reg:logistic',eval_metric='mlogloss'),
DecisionTreeClassifier(criterion = 'gini', max_depth = 3, random_state=1),
BaggingClassifier(base_estimator=DecisionTreeClassifier(criterion = 'gini', max_depth = 3, random_state=1), n_estimators=50,random_state=1),
AdaBoostClassifier(n_estimators=10, random_state=1),
GradientBoostingClassifier(n_estimators = 50,random_state=1),
RandomForestClassifier(n_estimators = 50, random_state=1,max_features=12),
LogisticRegression(solver="liblinear"),
GaussianNB(),
KNeighborsClassifier(n_neighbors= 5 , weights = 'distance' ),
SVC(gamma=0.025, C=3)
]
def scorer(self,X_train,Y_train):
"""evaluate all available models on the given dataset & find the best model"""
linepass() # debug info
self.scoreLog=pd.DataFrame()
# cross-validation shuffler
kfold = StratifiedKFold(n_splits=10, random_state=129, shuffle=True)
# assign scoring to recall score of "Yes" label in Y_train
recall=metrics.make_scorer(metrics.recall_score,pos_label="Yes")
for clf in self.models:
st1=time.process_time()
#cross validation scoring
cvres=cross_val_score(clf, X_train, Y_train, cv=kfold, scoring=recall)
t_taken=(time.process_time()-st1)/10 # for 10 kFolds
self.scoreLog.loc[str(clf)[:str(clf).find('(')],["score"]]=cvres.mean()
self.scoreLog.loc[str(clf)[:str(clf).find('(')],["time"]]=t_taken
# best performing model selection
self.scoreLog.loc[self.scoreLog.sort_values(by="score",ascending=False).index,"score_rank"]=range(1,11,1)
self.scoreLog.loc[self.scoreLog.sort_values(by="time",ascending=True).index,"time_rank"]=range(1,11,1)
self.scoreLog["combined"]=self.scoreLog.loc[:,"score_rank"]+self.scoreLog.loc[:,"time_rank"]
# choice based on best performance in least time
self.choice=self.scoreLog.sort_values(by="combined",ascending=True).index[0]
for clf in self.models:
if str(clf)[:str(clf).find('(')]==self.choice:
self.best_model=clf
def fit(self,X_train,Y_train):
"""fit all models available"""
linepass() # debug info
# evaluate all models
self.scorer(X_train,Y_train)
#fit the best model
self.best_model.fit(X_train, Y_train)
self.fitted=True
def pickling(fname,cleaner,preprocessor,classifier,classifer_name):
"""save trained model(s) to hard disk"""
linepass() # debug info
# save the model to local directory
pickle.dump((cleaner,preprocessor,classifier,classifer_name), open(fname, 'wb'))
def trainer(files,path=list(""),on=None,how='outer',drop_on=False,target=None):
"""ML classifier solver"""
linepass() # debug info
#invoke data class object
df=data(files,path,on,how,drop_on,target)
global X_train, Y_train, X_test, Y_test
X_train, Y_train, X_test, Y_test = df.splitter()
X_train=df.fit_transform(X_train)
# invoke preprocessing class
prep = preprocessing()
X_train=prep.fit_transform(X_train)
# balance training data set
balancer = SMOTE(sampling_strategy='not majority', random_state=129)
X_train, Y_train = balancer.fit_resample(X_train,Y_train)
# tryout various base models
model=classifiers()
model.fit(X_train, Y_train)
# save the trained model to local disk
pickling('best_pickle.har',df,prep,model.best_model,model.choice)
def main(files,path=list(""),on=None,how='outer',drop_on=False,target=None):
"""train & predict for classification poblem"""
imports()
global X_test, Y_test
# training
trainer(files,path,on,how,drop_on,target)
# load the model from disk
m1,m2,clf,clf_name = pickle.load(open('best_pickle.har', 'rb'))
X_test=m1.transform(X_test)
X_test=m2.transform(X_test)
pred = clf.predict(X_test)
# display scores
print(clf_name.upper()+" MODEL")
print("ACCURACY SCORE",metrics.accuracy_score(Y_test,pred))
print("\nCLASSIFICATION METRICS\n",metrics.classification_report(Y_test,pred))
#confusion matrix
z=pd.DataFrame(metrics.confusion_matrix(Y_test, pred))
fig1=ff.create_annotated_heatmap(np.array(z),annotation_text=np.array(z),
x=list(np.sort(np.unique(Y_test))),y=list(np.sort(np.unique(Y_test))),
colorscale='Mint',font_colors = ['grey','white'],name=clf_name,
hovertemplate="Count: %{z:d}")
fig1.update_layout(height=350,width=350)
fig1.update_xaxes(title_text=clf_name.upper()+" CONFUSION MATRIX<br><br>PREDICTED")
fig1.update_yaxes(title_text="TRUE",tickangle=270)
fig1.show()
debug=False
# invoke main function
files = ["TelcomCustomer-Churn_1.csv","TelcomCustomer-Churn_2.csv"]
path = r"C:\Users\HARI SAMYNAATH S\Anaconda_workspace\GLAIML_course\04 - Ensemble Technique\\"
on = "customerID"
target = "Churn"
main(files,path,on,'outer',False,target)
DECISIONTREECLASSIFIER MODEL
ACCURACY SCORE 0.6408800567778566
CLASSIFICATION METRICS
precision recall f1-score support
No 0.94 0.53 0.68 1013
Yes 0.43 0.92 0.59 396
accuracy 0.64 1409
macro avg 0.69 0.72 0.64 1409
weighted avg 0.80 0.64 0.66 1409
a ML pipeline was established
best model was identified and pickled for production readiness
test data was treated eauivalent to production data, with no dataleaks
the model was chosen based on recall score for Churn=Yes class
interestingly, the accuracy of the model is poor, while recall was at peak
this is not a by chance result, as a 10 fold cross validation score was referenced for ranking models,
yet, further evaluations methods could be implemented in future to build a better trustable model
follow-up action needed: